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Abstract. VO-Neural is the natural evolution of the Astroneural project which was started 
in 1994 with the aim to implement a suite of neural tools for data mining in astronomical 
massive data sets. At a difference with its ancestor, which was implemented under Matlab, 
VO-Neural is written in C++, object oriented, and it is specifically tailored to work in 
distributed computing architectures. We discuss the current status of implementation of VO- 
Neural, present an application to the classification of Active Galactic Nuclei, and outline the 
ongoing work to improve the functionalities of the package. 
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1. Introduction 

One of the main goals of the International 
Virtual Observatory (VOb) is the federation 
under common standards of a ll astron omical 
archives available worldwide (IURL.3I) . Once 
this meta-archive will be completed, its ex- 
ploitation will allow a new type of multi- 
wavelenght, multi-epoch s cience which ca n 
only be barely imagined (IDjorgovski 
but will also pose unprecedented computing 
problems. From a mathematical point of view, 
in fact, most of the operations performed by 
the astronomers during their every-day life can 
be reconduced (either consciously or uncon- 
sciously) to standard data mining tasks such as, 
for instance, clustering, classification, pattern 
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recognition and trend analysis. All these tasks 
scale very badly when either the number N of 
records to be processed or the number D of fea- 
tures characterizing each record, increase: 

- clustering scales as ~ N x log N x N 2 , and 

as ~ D 2 ; 

- search for correlations scales as ~ N x 
log N x N 2 , and as ~ D k with k > 1; 

- bayesian or likelihood algorithms scale as 
~ N m with m > 3 and as- D k with k > 1. 

To get an idea of the computational demands 
posed by the VOb we shall just notice that 
a modern digital survey can easily produce 
datasets having N ~ 10 9 and D » 10 2 
and leave to the reader to imagine what could 
be the demands of a multiwavelenght, multi- 
epoch survey. It is apparent that the extrac- 
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tion of knowledge from such data sets can not 
be performed with traditional SW(URL.2I) & 
HW, and requires some form of high perfor- 
mance computing (HPC). The traditional HPC 
approach based on parallel multi-CPU soft- 
ware running on dedicated clusters, is how- 
ever against the very same phylosophy of the 
VOb which aims at opening the exploitation of 
its data archives also to scientists who do not 
have access to large HPC centers. In this re- 
spect, the GRID seems to offer the most nat- 
ural and democratic answer since, at least in 
theory, it allows any user possessing a personal 
certificate to access the distributed computing 
resources. The VOb, however, for the same fact 
of being open to use by the community at large, 
does not match the security requirements of the 
GRID and this limitation strongly undermines 

its effe ctiveness. 

In dDen iskina et al. 2008) we discuss the 
first version of GRID - Launcher , a tool which 
interfaces the UK-A STROG RID dURL.il) with 
the GRID-SCOPE (lURL.fj) . In this contribu- 
tion we discuss instead the str ucture o f the data 
mining package VO-Neural dURL.7l) which is 
specifically designed to perform complex data 
mining (DM) tasks on astronomical (but not 
only) massive data sets (MDS). As an exem- 
plification, in SectfJ] we also show how the 
methods so far implemented can be used to ad- 
dress the challenging task of obtaining an ob- 
jective classification of Active Galactic Nuclei 
(AGN). Finally, in the last Section we shortly 
outline some ongoing and planned develop- 
ments. 

2. VO-Neural 

VO-Neural is a data mining framework, whose 
goal is to provide the astronomical com- 
munity with powerful software instruments 
capable to work on massive (> 1 TB) data 
sets (catalogues) in a distributed computing 
environment matching the IVOA standards 
and requirements. VO-N eural is the evolution 
of the AstroNeural dTagliaferri et al. 20031) 
project which was started in 1994, as a 
collaboration between the Department 
of Mathematics and Applications at the 
University of Salerno and the Astronomical 



Observatory of Capodimonte-INAF, and is 
currently under continuous evolution. VO- 
Neural allows to extract from large datasets 
information useful to determine patterns, 
relationships, similarities and regularities 
in the space of parameters, and to identify 
outlay ers. In its final version, it will offer 
main elaborative features like exploratory data 
analysis, data prediction and ancillary func- 
tionality like fine tuning, visual exploration 
of the main characteristics of the datasets, 
etc.. Besides offering the possibility to use the 
individual routines to perform specific tasks, 
VO-Neural will provide the user with a com- 
plete framework to write his own customized 
programs. 

Without entering into too many details we 
shall just recall that, in our view, data explo- 
ration means agglomerative clustering and di- 
mensional reduction of parametric space; data 
prediction means prediction, classification and 
regression; fine tuning means Not a Number 
(NaN) or upper limits determination and out- 
layers, catalogue statistical analysis and data 
extraction. 

With reference to Figfflwe specify that de- 
terministic, self-adaptive and statistical meth- 
ods are implemented to achieve the above 
functions requirements as embedded in a 
generic pipeline. Deterministic models include 
triggers and data reduction algorithms. Self- 
adaptive models are organized in supervised 
and unsupervised tools. Statistical models refer 
to simple statistical functions, either Bayesian 
or not-Bayesian and dimensional reduction 
models to clustering methods like Probabilistic 
Principal Surfaces (PPS) and Negative Entropy 
Clustering (NEC). Classification includes self- 
adaptive models like supervised neural net- 
work (MLP with back-propagation and genetic 
algorithms, C-SVC and NU-SVC) and, finally, 
regression refers to Multi Layer Perceptron, 
other supervised self-adaptive models, like 
EPSILON-SVR and NU-SVR, and to data fit- 
ting deterministic algorithms. Moreover a set 
of graphical analysis tools (such as histograms 
and wisker & bar plot, etc.) is included. 

VO-Neural is built around the following 
standards: 
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- XP-agile as suite designing method; 

- UML (Unified Modeling Language); 

- OOP (Object Oriented Programming); 



- interface protocols based on EGEE, VO & 
AstroGrid paradigms; 
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- standard I/O interface methods for soft- 
ware systems integrity; 

- SVN (Subversion) software version for 
control & archiving; 

- webservice-based user interfaces. 

In the next two paragraphs we shortly out- 
line the main features of two supervised clus- 
tering models already included in the package 
which have already been used for specific sci- 
ence applications. 

2.1. VONeuraLMLP 

VONeuraLMLP is an implementation of 
a standard Multi Layer Perception based 
on the FANN ( Fast A rtificial Neural 
Networks) Librar y dURL.4l) . written in C 
(ISkordovski20 08). and tailored to be launched 
as web service from the ASTROGRID 
Workbench. The algorithm known as Multi 
Layer Perceptron (MLP) is based on the 
concept of perceptron and the method of 
learning is based on gradient-descent method 
that allows to find a local minimum of a 
function in a space with N dimensions. The 
weights associated to the connections between 
the layers of neurons, initialized at small and 
random values, and then the MLP applies the 
learning rule using the template patterns. 

2.2. VONeuralJS VM 

VONeuraLSVM is an implem entation of 
the Support Vector Machines dRusso 2007t : 
ICavuoti 2008b based on the LIBSVM library 
(IURL.5I) . Support Vector Machines perform 
classification of records into classes by first 
mapping the data into an higher dimensional- 
ity and then using a set of template vectors 
(targets) to find in this new space an iperplane 
of separation with the largest margin possi- 
ble. Withouth enetering into details (which can 
be found in (Boser et al., 1992; Cortes and 
Vapnik, 1995), we shall just remember that, in 
the case of the C-SVC implemented with the 
RBF (Radial Basis Functions) kernel, the po- 
sition of this hyperplane depends on two pa- 
rameters (C and F) which cannot be estimated 
in advance but need to be evaluated by find- 



ing the maximum in a grid of values which 
is usually defined by letting C and F vary as 
C = 2-\ 2-\ 2 15 and T = 2~ 15 , 2~ 13 , 2 3 . 
Due to their computational weight, and to the 
need to run many iterations for different pairs 
of the two parameters, SVM are ideally suited 
for the GRID. 

3. The classification of AGN 

The astronomical community is used to per- 
form DM tasks in a sort of "hidden" way (cf. 
the case of specific objects selection in a color- 
color diagram) but it has not yet become fa- 
miliar with the potentialities of more advanced 
tools such as those described here. This is 
mainly due to the fact that these tools are of- 
ten everything but user friendly and require an 
in depth understanding of the (often complex) 
theory laying behind them; a complexity which 
often discourages potential users. Therefore, a 
crucial aspect of the project is the application 
to challenging problems which, can better ex- 
emplify the new science which will emerge 
from the adoption of a less conservative ap- 
proach to the analysis of the data. Two science 
cases, namely the evaluation of photometric 
redshifts (a regression and classification prob- 
lem based on the use of VONeuraLMLP) and 
the selection of cand idate quasars in the Sloan 
Digital Sky Survey dcf. Stoughton et al. 2002h 
(based on the use of unsupervised cluster- 
ing algorithms and agglomerative clustering) 
have already been published in the litera- 
ture dD' Abrusco 20071; |D' Abrusco et al. 20071: 
ID' Abrusco et al. 20081) . We shall therefore fo- 
cus on a new application of VONeuralJS VM 
to the classification of AGNs. 
The classification of AGN is usually per- 
formed on their overall spectral distribution 
using some spectroscopic indicators (equiva- 
lent linewidths, FWHM of specific lines or 
lines flux ratios) and diagnostic diagrams 
(usually called BPT). In this diagrams AGN 
and not-AGN are empirically separated by 
some lines derived either from the theory or 
fro m empirical laws such as those derived 
bv dKewlev et al. 200lt iKauffman et al. 20031: 
Heckman 19801) . A reliable and accurate AGN 
classificator based on photometric features 
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experiment 


BoK 


algorithm 


efficiency 


completeness 


AGN vs Mix 


BPT plot + Kewley line 


MLP 


76% 


54% 




BPT plot + Kewley line 


SVM 


74% 


55% 


Type 1 vs 2 


BPT plot + Kewley line 


MLP 


95% 


~ 100% 




BPT plot + Kewley line 


SVM 


82% 


98% 


Seyfert vs LINER 


BPT plot + Hecman & Kewley lines 


MLP 


80% 


92% 




BPT plot + Kewley line 


SVM 


78% 


89% 



Table 1. Summary of the results of supervised classification experiments performed using both 
VONeuraLMLP and VONeuraLS VM. 



only, would allow to save precious telescope 
time and enable several studies based on statis- 
tically significant samples of objects. We there- 
fore used a supervised clustering approach 
based on a Base of Knowledge (BoK) de- 
rived from the available catalogues. We wish 
to stress that since neural networks have no 
power of extrapolation all the biases in the 
BoK will be reproduced in the final results. As 
classification tools, we used the MLP and, due 
to the intrinsically binary nature of the prob- 
lem (AGN against non-AGN, Seyfert 1 against 
Seyfert 2, etc) also the SVM. The BoK was ob- 
tained from the fusion of two catalogues. 

- (ISorrentino et al. 20061) separated ob- 
jects into Seyfert 1, Seyfert 2 and 
"Not AGN" using the Kewley's lines 
(iKewlev et al. 20011) : 

- a catalogue derived by us from the SDSS 
spectroscopic a rchive using the cri teria 
introduced by ( Kauffman et al. 2003b in 
which objects are classified as AGN, 
not AGN, and "mixed". The Mix and 
Pure AGN zone were further divided into 
Sey fert and LINERs by using the Heckman 
line dHeckman 19801) . 

We made three experiments using both the 
MLP and SVM, and for all of them we used 
the same set of features (for a definition refer 
to the SDSS specifications) extracted from the 
SDSS database: petroR50.u, petroR50.g, 
petroR50-r, petroR50J, petroR5Qjz, 
concentration Jndex_r, fibermagjr, 
(u — g)dered, (g — r)dered, (r — i)dered, 
(i - zjdered, deredjr, together with the pho - 
tometric redshift in (D'Abrusco et al. 20071) . 
We performed three types of classification 
experiments: AGN vs Mix, Typel vs Type2, 



Seyfert vs LINER. The experiments with SVM 
were performed on the GRID-SCOPE using 
110 worker nodes. The results are summarized 
in Table [3] 

As it can be seen, the use of machine learn- 
ing tools allows to reach performances which 
in some cases (e.g. Type 1 vs 2 with MLP's) 
cannot by any means be achieved with more 
traditional tools. A more detailed discussion 
of the results will be presented in (Cavuoti, 
d'Abrusco & Longo, 2008, in preparation). 

4. Future developments 

The ongoing work is focused on three main 
aspects: i) implementing better methods 
through an extensive parallelization of the 
already existing codes; ii) improving the inter- 
facement of the package with the GRID; iii) 
incorporating within the VO-Neural package 
tools capable to extract information from the 
data collected from the new generation of 
astroparticle physics experiments. 
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